{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "<center>\n",
    "<img src=\"https://raw.githubusercontent.com/Yorko/mlcourse.ai/master/img/ods_stickers.jpg\" />\n",
    "\n",
    "<h1>Open Machine Learning Course</h1>\n",
    "\n",
    "<center>\n",
    "Auteur: [Yury Kashnitskiy](https://yorko.github.io).  \n",
    "Traduit et édité par [Ousmane Cissé](https://www.linkedin.com/in/ousmane-ciss%C3%A9/)  \n",
    "Ce matériel est soumis aux termes et conditions de la licence [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).  \n",
    "L'utilisation gratuite est autorisée à des fins non commerciales."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "# <center>Topic 7. Apprentissage non supervisé\n",
    "## <center>Part 2. Clustering. k-Means\n",
    "    \n",
    "**Il s'agit principalement de démontrer certaines applications du k-Means, pour la théorie, l'étude [sujet 7](https://mlcourse.ai/notebooks/blob/master/jupyter_english/topic07_unsupervised/topic7_pca_clustering.ipynb?flush_cache=true ) dans notre cours**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Regroupement (clustering) des joueurs NBA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Quelques <a href=\"http://www.databasebasketball.com/about/aboutstats.htm\">infos</a> sur les variables des joueurs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "nba = pd.read_csv(\"../../data/nba_2013.csv\")\n",
    "nba.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "from sklearn.decomposition import PCA\n",
    "\n",
    "kmeans = KMeans(n_clusters=5, random_state=1)\n",
    "numeric_cols = nba._get_numeric_data().dropna(axis=1)\n",
    "kmeans.fit(numeric_cols)\n",
    "\n",
    "\n",
    "# Visualizing using PCA\n",
    "pca = PCA(n_components=2)\n",
    "res = pca.fit_transform(numeric_cols)\n",
    "plt.figure(figsize=(12,8))\n",
    "plt.scatter(res[:,0], res[:,1], c=kmeans.labels_, s=50, cmap='viridis')\n",
    "plt.title('PCA')\n",
    "\n",
    "# Visualizing using 2 features: Total points vs. Total assists\n",
    "plt.figure(figsize=(12,8))\n",
    "plt.scatter(nba['pts'], nba['ast'], \n",
    "            c=kmeans.labels_, s=50, cmap='viridis')\n",
    "plt.xlabel('Total points')\n",
    "plt.ylabel('Total assitances')\n",
    "plt.title('Some interpretation');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Compression d'images avec k-Means\n",
    "*pas une technique populaire*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.image as mpimg\n",
    "\n",
    "img = mpimg.imread('../../img/woman.jpg')[..., 1]\n",
    "plt.figure(figsize = (20, 12))\n",
    "plt.axis('off')\n",
    "plt.imshow(img, cmap='gray');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import randint\n",
    "from sklearn.cluster import MiniBatchKMeans\n",
    "\n",
    "X = img.reshape((-1, 1))\n",
    "k_means = MiniBatchKMeans(n_clusters=3)\n",
    "k_means.fit(X) \n",
    "values = k_means.cluster_centers_\n",
    "labels = k_means.labels_\n",
    "img_compressed = values[labels].reshape(img.shape)\n",
    "plt.figure(figsize = (20, 12))\n",
    "plt.axis('off')\n",
    "plt.imshow(img_compressed, cmap = 'gray');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "# Recherche de sujets latents dans les textes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Nous appliquerons k-Means aux groupes de textes de 4 catégories.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from time import time\n",
    "\n",
    "from sklearn import metrics\n",
    "from sklearn.datasets import fetch_20newsgroups\n",
    "from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer\n",
    "from sklearn.preprocessing import Normalizer\n",
    "\n",
    "categories = [\n",
    "    'alt.atheism',\n",
    "    'talk.religion.misc',\n",
    "    'comp.graphics',\n",
    "    'sci.space']\n",
    "\n",
    "print(\"Loading 20 newsgroups dataset for categories:\")\n",
    "print(categories)\n",
    "\n",
    "dataset = fetch_20newsgroups(subset='all', categories=categories,\n",
    "                             shuffle=True, random_state=42)\n",
    "\n",
    "print(\"%d documents\" % len(dataset.data))\n",
    "print(\"%d categories\" % len(dataset.target_names))\n",
    "\n",
    "labels = dataset.target\n",
    "true_k = np.unique(labels).shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Créer des variables Tf-Idf pour les textes**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Extracting features from the training dataset using a sparse vectorizer\")\n",
    "vectorizer = TfidfVectorizer(max_df=0.5, max_features=1000,\n",
    "                             min_df=2, stop_words='english')\n",
    "\n",
    "X = vectorizer.fit_transform(dataset.data)\n",
    "print(\"n_samples: %d, n_features: %d\" % X.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Appliquez k-Means aux vecteurs que nous avons. Calculez également les métriques de clustering.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "km = KMeans(n_clusters=true_k, init='k-means++', \n",
    "            max_iter=100, n_init=1)\n",
    "\n",
    "print(\"Clustering sparse data with %s\" % km)\n",
    "t0 = time()\n",
    "km.fit(X)\n",
    "\n",
    "print(\"Homogeneity: %0.3f\" % metrics.homogeneity_score(labels, km.labels_))\n",
    "print(\"Completeness: %0.3f\" % metrics.completeness_score(labels, km.labels_))\n",
    "print(\"V-measure: %0.3f\" % metrics.v_measure_score(labels, km.labels_))\n",
    "print(\"Adjusted Rand-Index: %.3f\"\n",
    "      % metrics.adjusted_rand_score(labels, km.labels_))\n",
    "print(\"Silhouette Coefficient: %0.3f\"\n",
    "      % metrics.silhouette_score(X, km.labels_, sample_size=1000))\n",
    "\n",
    "order_centroids = km.cluster_centers_.argsort()[:, ::-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Mots trouvés proches des centres de cluster**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "terms = vectorizer.get_feature_names()\n",
    "for i in range(true_k):\n",
    "    print(\"Cluster %d:\" % (i + 1), end='')\n",
    "    for ind in order_centroids[i, :10]:\n",
    "        print(' %s' % terms[ind], end='')\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "## Regroupement des chiffres manuscrits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "\n",
    "digits = load_digits()\n",
    "\n",
    "X, y = digits.data, digits.target"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kmeans = KMeans(n_clusters=10)\n",
    "kmeans.fit(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import adjusted_rand_score\n",
    "\n",
    "adjusted_rand_score(y, kmeans.predict(X))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_, axes = plt.subplots(2, 5)\n",
    "for ax, center in zip(axes.ravel(), kmeans.cluster_centers_):\n",
    "    ax.matshow(center.reshape(8, 8), cmap=plt.cm.gray)\n",
    "    ax.set_xticks(())\n",
    "    ax.set_yticks(())"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "name": "lesson8_part1_kmeans.ipynb",
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
